Add non-power-of-2 shapes for Morton coding to benchmarks by mkitti · Pull Request #3717 · zarr-developers/zarr-python

mkitti · 2026-02-20T21:30:17Z

tests: Add non-power-of-2 shard shapes to
benchmarks
tests: Add near-miss power-of-2 shape (33

[Description of PR]

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/user-guide/*.md
Changes documented as a new file in changes/
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

Add (30,30,30) to large_morton_shards and (10,10,10), (20,20,20), (30,30,30) to morton_iter_shapes to benchmark the scalar fallback path for non-power-of-2 shapes, which are not fully covered by the vectorized hypercube path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Documents the performance penalty when a shard shape is just above a power-of-2 boundary, causing n_z to jump from 32,768 to 262,144. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mkitti · 2026-02-20T21:43:35Z

Benchmark Results

These benchmarks were run on this branch (which includes the vectorized get_chunk_slice from #3713) to characterize Morton order performance across power-of-2 and non-power-of-2 shard shapes.

`test_morton_order_iter` — pure Morton computation, no I/O, LRU cache cleared each round

Shape	Elements	Type	Mean time
`(8,8,8)`	512	power-of-2	0.45 ms
`(16,16,16)`	4,096	power-of-2	3.6 ms
`(32,32,32)`	32,768	power-of-2	28.9 ms
`(10,10,10)`	1,000	non-power-of-2	9.6 ms
`(20,20,20)`	8,000	non-power-of-2	88.2 ms
`(30,30,30)`	27,000	non-power-of-2	125.6 ms
`(33,33,33)`	35,937	near-miss (+1 above 32³)	767 ms

The near-miss penalty is striking: (33,33,33) has only ~10% more elements than (32,32,32) but takes 27× longer. This is because the current floor-hypercube approach must scalar-decode many Morton codes beyond the guaranteed in-bounds region.

`test_sharded_morton_write_single_chunk` — write 1 chunk to a large shard, cache cleared each round

Shape	Chunks/shard	Mean time
`(32,32,32)`	32,768	35.7 ms
`(30,30,30)`	27,000	127.5 ms
`(33,33,33)`	35,937	767.8 ms

`test_sharded_morton_single_chunk` — read 1 chunk from a large shard (cached after first access)

Shape	Mean time
`(32,32,32)`	0.73 ms
`(30,30,30)`	0.69 ms
`(33,33,33)`	0.71 ms

Reads are fast across all shapes once the Morton order cache is warm (the first call pays the penalty, subsequent reads are cached).

Interpretation

The benchmarks confirm that non-power-of-2 shard shapes carry a significant Morton computation penalty under the current implementation, with near-miss shapes (like (33,33,33)) being especially slow. These benchmarks provide a baseline to measure improvements from follow-on optimization work.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ort strategy (#3718) * tests: Add non-power-of-2 shard shapes to benchmarks Add (30,30,30) to large_morton_shards and (10,10,10), (20,20,20), (30,30,30) to morton_iter_shapes to benchmark the scalar fallback path for non-power-of-2 shapes, which are not fully covered by the vectorized hypercube path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * tests: Add near-miss power-of-2 shape (33,33,33) to benchmarks Documents the performance penalty when a shard shape is just above a power-of-2 boundary, causing n_z to jump from 32,768 to 262,144. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * style: Apply ruff format to benchmark file Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * changes: Add changelog entry for PR #3717 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * perf: Fix near-miss penalty in _morton_order with hybrid ceiling+argsort strategy For shapes just above a power-of-2 (e.g. (33,33,33)), the ceiling-only approach generates n_z=262,144 Morton codes for only 35,937 valid coordinates (7.3× overgeneration). The floor+scalar approach is even worse since the scalar loop iterates n_z-n_floor times (229,376 for (33,33,33)), not n_total-n_floor. The fix: when n_z > 4*n_total, use an argsort strategy that enumerates all n_total valid coordinates via meshgrid, encodes each to a Morton code using vectorized bit manipulation, then sorts by Morton code. This avoids the large overgeneration while remaining fully vectorized. Result for test_morton_order_iter: (30,30,30): 24ms (ceiling, ratio=1.21) (32,32,32): 28ms (ceiling, ratio=1.00) (33,33,33): 32ms (argsort, ratio=7.3 → fixed from ~820ms with scalar) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: Address pre-commit CI failures in _morton_order - Replace Unicode multiplication sign × with ASCII x in comment (RUF003) - Add explicit type annotation for np.argsort result to satisfy mypy Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: Cast argsort result via np.asarray to resolve mypy no-any-return np.stack returns Any in mypy's view, so indexing into it also returns Any. Using np.asarray(..., dtype=np.intp) makes the type explicit and avoids the no-any-return error at the return site. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: Pre-declare order type to resolve mypy no-any-return in _morton_order np.asarray and np.stack return Any with numpy 2.1 type stubs, causing mypy to infer the return type as Any. Pre-declaring order as npt.NDArray[np.intp] before the if/else makes the intended type explicit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com>

mkitti and others added 2 commits February 20, 2026 16:27

tests: Add near-miss power-of-2 shape (33,33,33) to benchmarks

1dfd71d

Documents the performance penalty when a shard shape is just above a power-of-2 boundary, causing n_z to jump from 32,768 to 262,144. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 20, 2026

mkitti and others added 2 commits February 20, 2026 16:48

style: Apply ruff format to benchmark file

403c50b

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

changes: Add changelog entry for PR zarr-developers#3717

ffa3065

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Feb 21, 2026

mkitti mentioned this pull request Feb 21, 2026

perf: Fix near-miss penalty in _morton_order with hybrid ceiling+argsort strategy #3718

Merged

6 tasks

Merge branch 'main' into mkitti-morton-benchmarks

8eeb56b

d-v-b approved these changes Feb 24, 2026

View reviewed changes

d-v-b enabled auto-merge (squash) February 24, 2026 13:41

d-v-b merged commit 32c7ab9 into zarr-developers:main Feb 24, 2026
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add non-power-of-2 shapes for Morton coding to benchmarks#3717

Add non-power-of-2 shapes for Morton coding to benchmarks#3717
d-v-b merged 5 commits intozarr-developers:mainfrom
mkitti:mkitti-morton-benchmarks

mkitti commented Feb 20, 2026

Uh oh!

mkitti commented Feb 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mkitti commented Feb 20, 2026

Uh oh!

mkitti commented Feb 20, 2026

Benchmark Results

test_morton_order_iter — pure Morton computation, no I/O, LRU cache cleared each round

test_sharded_morton_write_single_chunk — write 1 chunk to a large shard, cache cleared each round

test_sharded_morton_single_chunk — read 1 chunk from a large shard (cached after first access)

Interpretation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`test_morton_order_iter` — pure Morton computation, no I/O, LRU cache cleared each round

`test_sharded_morton_write_single_chunk` — write 1 chunk to a large shard, cache cleared each round

`test_sharded_morton_single_chunk` — read 1 chunk from a large shard (cached after first access)